AITopics | speech encoder

Collaborating Authors

speech encoder

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

344ef5151be171062f42f03e69663ecf-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 10:30:40 GMT

artificial intelligence, machine learning, speech-t, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation Qingkai Fang 1,2, Y an Zhou 1,2, Y ang Feng 1,2 1

Neural Information Processing SystemsFeb-17-2026, 16:32:38 GMT

In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.

machine learning, natural language, translation, (19 more...)

Neural Information Processing Systems

Country:

Europe > Austria > Vienna (0.14)
Asia > South Korea > Incheon > Incheon (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
(12 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

MultimodalandMultilingualEmbeddings forLarge-ScaleSpeechMining

Neural Information Processing SystemsFeb-9-2026, 15:35:56 GMT

Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from CommonCrawl.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Spain (0.04)
Europe > Austria > Styria > Graz (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.49)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.48)

Add feedback

SENSE models: an open source solution for multilingual and multimodal semantic-based tasks

Mdhaffar, Salima, Elleuch, Haroun, Chellaf, Chaimae, Nguyen, Ha, Estève, Yannick

arXiv.org Artificial IntelligenceDec-10-2025

Abstract--This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder . The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders. Speech foundation models based on self-supervised learning (SSL) have brought significant advances in speech processing. These models, such as wav2vec 2.0 [1], HuBERT [2], and WavLM [3], generate learned speech representations that can be applied to a wide range of downstream speech processing tasks. By training on large amounts of unlabelled speech data, SSL models have demonstrated the ability to capture crucial speech features, such as phonemes and other acoustic units [4]. This capability has led to significant progress in multiple downstream tasks, including speech recognition [1], speech translation [5], speech separation, speaker verification, speaker diarization [3], and emotion detection [6]. Different approaches have been proposed to pretrain model by aligning speech and text, like mSLAM [7], a Massively multilingual joint pre-training for speech and text.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.12093

Country: Africa (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

Gao, Yan, Yang, Yazheng, Lan, Zhibin, Chen, Yidong, Zhang, Min, Wei, Daimeng, Huang, Hui, Su, Jinsong

arXiv.org Artificial IntelligenceNov-17-2025

Code-switching (CS) speech translation (ST) refers to translating speech that alternates between two or more languages into a target language text, which poses significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies tend to rely on the model itself to implicitly learn semantic modeling during training, and resort to inefficient and costly manual annotations for these two challenges. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture of Experts (MoE) speech projector, where each expert specializes in the semantic subspace of a specific language, enabling fine-grained modeling of speech features. Additionally, we introduce a multi-stage training paradigm that utilizes readily available monolingual automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation capabilities. During training, we leverage a combination of language-specific loss and intra-group load balancing loss to guide the MoE speech projector in efficiently allocating tokens to the appropriate experts, across expert groups and within each group, respectively. To bridge the data gap across different training stages and improve adaptation to the CS scenario, we further employ a transition loss, enabling smooth transitions of data between stages, to effectively address the scarcity of high-quality CS speech translation data. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach.

artificial intelligence, machine translation, natural language, (13 more...)

arXiv.org Artificial Intelligence

2511.1067

Country: Asia > China (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Omnilingual ASR team, null, Keren, Gil, Kozhevnikov, Artyom, Meng, Yen, Ropers, Christophe, Setzler, Matthew, Wang, Skyler, Adebara, Ife, Auli, Michael, Balioglu, Can, Chan, Kevin, Cheng, Chierh, Chuang, Joe, Droof, Caley, Duppenthaler, Mark, Duquenne, Paul-Ambroise, Erben, Alexander, Gao, Cynthia, Gonzalez, Gabriel Mejia, Lyu, Kehan, Miglani, Sagar, Pratap, Vineel, Sadagopan, Kaushik Ram, Saleem, Safiyyah, Turkatenko, Arina, Ventayol-Boada, Albert, Yong, Zheng-Xin, Chung, Yu-An, Maillard, Jean, Moritz, Rashel, Mourachko, Alexandre, Williamson, Mary, Yates, Shireen

arXiv.org Artificial IntelligenceNov-14-2025

Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.0969

Country:

Africa (1.00)
North America > United States (0.67)
Asia > Indonesia (0.67)
(3 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine (1.00)
Education (0.67)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

Feng, Pengchao, Ma, Ziyang, Chen, Wenxi, Li, Yao, Wang, Sheng, Yu, Kai, Chen, Xie

arXiv.org Artificial IntelligenceNov-12-2025

End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2505.00028

Country:

Asia > China (0.15)
Asia > Middle East > Republic of Türkiye (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (0.48)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

Zhang, Binbin, Liang, Chengdong, Wang, Shuai, Geng, Xuelong, Guo, Zhao, Li, Haoyu, Yin, Hao, Yang, Xipeng, Zhang, Pengshen, Ma, Changwei, Xie, Lei

arXiv.org Artificial IntelligenceOct-30-2025

In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/

arxiv preprint arxiv, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.19902

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

Ghannam, Ahmad, Alharthi, Naif, Alasmary, Faris, Tabash, Kholood Al, Sadah, Shouq, Ghouti, Lahouari

arXiv.org Artificial IntelligenceOct-29-2025

In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.

computational linguistic, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2510.24247

Country:

Europe (0.94)
Asia > Middle East > Saudi Arabia (0.14)
North America > Mexico > Mexico City (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Filters

Collaborating Authors

speech encoder

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

e5b1c0d4866f72393c522c8a00eed4eb-Paper-Conference.pdf

344ef5151be171062f42f03e69663ecf-Supplemental.pdf

DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation Qingkai Fang 1,2, Y an Zhou 1,2, Y ang Feng 1,2 1

MultimodalandMultilingualEmbeddings forLarge-ScaleSpeechMining

SENSE models: an open source solution for multilingual and multimodal semantic-based tasks

Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations